Tibetan Word Segmentation as Syllable Tagging Using Conditional Random Field
نویسندگان
چکیده
In this paper, we proposed a novel approach for Tibetan word segmentation using the conditional random field. We reformulate the segmentation as a syllable tagging problem. The approach labels each syllable with a word-internal position tag, and combines syllable(s) into words according to their tags. As there is no public available Tibetan word segmentation corpus, the training corpus is generated by another segmenter which has an F-score of 96.94% on the test set. Two feature template sets namely TMPT-6 and TMPT-10 are used and compared, and the result shows that the former is better. Experiments also show that larger training set improves the performance significantly. Trained on a set of 131,903 sentences, the segmenter achieves an F-score of 95.12% on the test set of 1,000 sentences.
منابع مشابه
Tibetan Syllable-Based Functional Chunk Boundary Identification
Tibetan syntactic functional chunk parsing is aimed at identifying syntactic constituents of Tibetan sentences. In this paper, based on the Tibetan syntactic functional chunk description system, we propose a method which puts syllables in groups instead of word segmentation and tagging and use the Conditional Random Fields (CRFs) to identify the functional chunk boundary of a sentence. Accordin...
متن کاملPart of Speech Tagging for Amharic using Conditional Random Fields
We applied Conditional Random Fields (CRFs) to the tasks of Amharic word segmentation and POS tagging using a small annotated corpus of 1000 words. Given the size of the data and the large number of unknown words in the test corpus (80%), an accuracy of 84% for Amharic word segmentation and 74% for POS tagging is encouraging, indicating the applicability of CRFs for a morphologically complex la...
متن کاملUsing Part-of-Speech Reranking to Improve Chinese Word Segmentation
Chinese word segmentation and Part-ofSpeech (POS) tagging have been commonly considered as two separated tasks. In this paper, we present a system that performs Chinese word segmentation and POS tagging simultaneously. We train a segmenter and a tagger model separately based on linear-chain Conditional Random Fields (CRF), using lexical, morphological and semantic features. We propose an approx...
متن کاملCombining Character-Based and Subsequence-Based Tagging for Chinese Word Segmentation
Chinese word segmentation is the initial step for Chinese information processing. The performance of Chinese word segmentation has been greatly improved by character-based approaches in recent years. This approach treats Chinese word segmentation as a character-wordposition-tagging problem. With the help of powerful sequence tagging model, character-based method quickly rose as a mainstream tec...
متن کاملAutomatic segmentation of English words using phonotactic and syllable information
It is difficult to demonstrate the effectiveness of prosodic features in automatic word recognition. Recently, we applied the suprasegmental concept and proposed an extra layer of acoustic modeling with syllables. Nevertheless, there is a mismatch between the syllable and the word units and that makes subsequent steps after acoustic modeling difficult. In this study, we explore English word seg...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011